-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Updating the JIT to handle the FMA hardware intrinsics #18105
Conversation
} | ||
else | ||
{ | ||
// TODO-XArch-CQ: Technically any one of the three operands can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt, will there be any problems if we set op1, op2, and op3 as regOptional? Will the JIT pick just one, or will it actually try to spill all three?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to pick just one. #6361 is the issue that tracks allowing multiple operands to be specified as regOptional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'm going to update the comment with a link to the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
{ | ||
for (var i = 1; i < RetElementCount; i++) | ||
{ | ||
if (BitConverter.DoubleToInt64Bits(Math.Round((firstOp[i] * secondOp[i]) + thirdOp[i], 13)) != BitConverter.DoubleToInt64Bits(Math.Round(result[i], 13))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the FMA tests do the same thing, but I am going to explicitly call it out here.
We don't have a good way to test FMA today due to the way the operation is executed. Technically, the operation does:
temp = firstOp * secondOp
temp = temp + thirdOp
result = round(temp)
This is in contrast to the two individual operations which do:
temp = firstOp * secondOp
result = round(temp)
temp = result + thirdOp
result = round(temp)
This is probably the "easiest" solution right now.
Also FYI for any reviewers. The tests are auto-generated from a template. The first commit (~650 Lines) contains the product changes and needs actual review. The second commit (the remaining 12k lines) contains only test changes and probably does not need in depth review (outside of checking the template input data and looking at one or two tests as an example). |
I still dont understand why the rounding should be a problem. GCC does use FMA for a*b+c by default. |
@dangi12012, Only if you:
Additionally, as per the IEEE 754:2008, automatically transforming If such automatic optimizations were provided, they would need to be behind a switch that users can enable or disable as required. Edit: It may also be worth mentioning that GCC is the only major compiler doing this (you have to opt-in using |
Rebased onto dotnet/master to pickup #18078 |
I think you could test the FMA computation by applying the TwoProduct and GrowExpansion routines from this paper to compute an exact result, then round it to a single/double value: https://people.eecs.berkeley.edu/~jrs/papers/robust-predicates.pdf |
@tannergooding Why don't you split PR into 2 commits: (i) JIT changes, (ii) Tests - to be consistent with our previous work? |
@4creators, that isn't consistent with my previous HWIntrinsic work and it isn't consistent with standard practice. Generally speaking, you shouldn't add new product code without corresponding tests exercising said code. As, otherwise, you don't have any validation that your code is correct and functioning properly. It is completely fine, however, to split the product code and tests into two separate commits. For discoverability, ease of review, etc. |
@Jorenkv, thanks for the reference. There are definitely many ways in which we can "exactly" validate the result. However, this may be more effort than it is worth and I would like to hear from @CarolEidt/@eerhardt before investing significant time in doing that. The general rule, so far, has been that the HWIntrinsics are contracted to:
There are of course certain exceptions, such as the "helper" intrinsics (which don't map to any particular instruction) or certain intrinsics (like The current validation logic ensures that we are emitting some form of MultiplyAdd instruction without worrying (too much) about the underlying implementation/behavior of said instruction (we are basically just validating we didn't encode the instruction wrong). |
OK So as usual you are contradicting yourself just to have different opinion 😄, AFAIR you explicitly asked me to submit PRs as two separate commits: (i) commit changing product code -> JIT, (ii) commit adding tests. Should I dig up your specific request? |
Yes, two commits (which is exactly the shape of the current PR). I may have misread your original comment as I thought you were asking why this isn't split into two PRs. |
@tannergooding Ahh, OK I see how this issue arised. |
Ok, test failures are all in the no AVX runs. FMA failures are because the ISA isn't filtered out when AVX is disabled (which, today, basically controls the VEX support that FMA requires). The remaining failures look to have been introduced by #16517 (the last non-PR run doesn't yet include my most recent changes to They all look to be similar in error: |
Rebased onto master to pick up #18116 |
Logged https://github.com/dotnet/coreclr/issues/18119 for the remaining, unrelated, failures. |
PR resolving the other test failures is here: #18120 |
@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic @dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic @dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic |
@CarolEidt, all issues should now be resolved. |
@tannergooding Ah of course. You're not doing software fallback so then there's no point in checking if the computation is precisely correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked through all the code, and have just a couple of comment/naming suggestions - as well as a couple of things to consider for future.
I plan to look at the encodings (instrxarch.h) just to double-check against the arch manual.
src/jit/compiler.cpp
Outdated
@@ -2606,6 +2582,34 @@ void Compiler::compSetProcessor() | |||
} | |||
} | |||
} | |||
|
|||
// There are currently two sets of flags that control AVX, FMA, and AVX2 support | |||
// This is the general EnableAVX flag and the individual ISA flags. We need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would end the first line with ':' and start the second line with "These are ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -5360,12 +5455,85 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber reg, | |||
} | |||
} | |||
|
|||
void emitter::emitIns_SIMD_R_R_R_A( | |||
instruction ins, emitAttr attr, regNumber reg, regNumber reg1, regNumber reg2, GenTreeIndir* indir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this (and subsequent), I would name the first reg argument targetReg
or dstReg
or something. Unlike some of the other methods, this one is designed only to support the case where the first argument is the dest, so it would be good to be descrptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I submit a separate PR fixing up all the register names here? Currently the majority are regNumber reg
where it should be regNumber targetReg
and it would be nice to fix them all up (I could also do it in this PR, if you think that is fine).
src/jit/gentree.cpp
Outdated
@@ -17422,6 +17422,16 @@ bool GenTree::isRMWHWIntrinsic(Compiler* comp) | |||
switch (AsHWIntrinsic()->gtHWIntrinsicId) | |||
{ | |||
case NI_SSE42_Crc32: | |||
case NI_FMA_MultiplyAdd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior. As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics
and then adding HW_Flag_RMWSemantics
. But that is a topic for another discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior.
I don't think we have that many of them (there are very few RMW intrinsics for the VEX encoding). I'll add a comment indicating we should revisit if this grows any more.
As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics and then adding HW_Flag_RMWSemantics. But that is a topic for another discussion.
Right, I think we have an issue tracking this. -- I think the primary problem right now is that the majority case is positive, and it is easier to track/add the minority case (whether positive or negative).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO-XArch-Cleanup
comment here.
varNum = tmpDsc->tdTempNum(); | ||
offset = 0; | ||
|
||
compiler->tmpRlsTemp(tmpDsc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible comment for future: if there isn't a method to do this (I couldn't find it off-hand), perhaps there should be (returning a varNum):
tmpDsc = getSpillTempDsc(op3);
varNum = tmpDsc->tdTempNum();
compiler->tmpRlsTemp(tmpDsc);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO-XArch-Cleanup
comment
@fiigii - would you have time to also have a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! And sorry it took me a while to get to this.
@CarolEidt Sorry, I have no time to look at the PR in detail until 6/10. Please feel free to move forward if it looks good to you. |
@fiigii - no problem, but thanks for the quick response. |
This resolves https://github.com/dotnet/coreclr/issues/18081
FYI. @CarolEidt, @fiigii, @eerhardt